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Abstract 

AB-testing is a very popular technique in web companies since it makes it possible to 
accurately predict the impact of a modification with the simplicity of a random split across 
users. One of the critical aspects of an AB-test is its duration and it is important to reliably 
compute confidence intervals associated with the metric of interest to know when to stop 
the test. In this paper, we define a clean mathematical framework to model the AB-test 
process. We then propose three algorithms based on bootstrapping and on the central 
limit theorem to compute reliable confidence intervals which extend to other metrics than 
the common probabilities of success. They apply to both absolute and relative increments 
of the most used comparison metrics, including the number of occurrences of a particular 
event and a click-through rate implying a ratio. 

Keywords: AB-test, Confidence interval. Central limit theorem, Ratio of normal vari¬ 
ables, Bootstrapping 

1. Introduction 

Evaluating complex web systems and their impact on user behavior is a challenge of growing 
importance. Data-driven tools have become very popular in the last decades to help in decid¬ 
ing which algorithm, which website home page, which user interface, etc, provides the best 
results in terms of some relevant criteria such as the generated revenue, the click-through 
rate (CTR), the number of visits, or any other business metric. A detailed description of 
the general data-driven paradigm is available in Darema (2004). 

Different experimention methods are available, (Kaushik, 2006, for a primer), and AB- 
testing, aka split or bucket testing, is wide-spread. For examples and best practices, we 
refer the reader to Crook et al. (2009); Kohavi et al. (2009, 2012) and references therein. 
This method compares two versions, A and B, of a system by splitting the users randomly 
into two independent populations to which systems A and B are respectively applied. We 
use the word system in a broad sense here as it can range from being the design of a web 
page (Swanson, 2011) to more complex algorithms such as a bidder on a real time bidding 
ad server (Zhang et al., 2014). Relevant metrics are then computed on each population and 
compared to decide which system performs better. 

Such comparisons rely on statistical tests to evaluate their significance, see for example 
Crocker and Algina (1986); Keppel (1991), among which Z-tests assess if the neutral hy¬ 
pothesis can be rejected or not at a fixed level of certainty. The simplest example is the 
one measuring a click-through rate, or any other rate that can only lead to binary values. 
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The click-through rate can be written as the empirical average of Bernoulli random vari¬ 
ables equal to 1 if the user has clicked and to 0 otherwise. Then, the central limit theorem 
provides confidence intervals for both the click-through rate in each population and its ab¬ 
solute increment between the two populations (see Amazon, 2010, for an example). In this 
case, the asymptotic variance is directly derived from the estimated click-through rate p as 
p(l —p)/n where n is the number of users. 

In practice, a user might click several times. Then the random variables that are av¬ 
eraged are no longer distributed under the Bernoulli law and the asymptotic variance can 
not be computed in the same way. We show that using such an approximation can even 
be dangerous through a numerical application to CTR. As stated in Kohavi et al. (2009), 
we need to use the variance of the number of clicks per user. They also provide confidence 
intervals for their relative increment using an approximation for the ratio adapted from 
Willan and Briggs (2006) but estimators for the involved variances are not provided for 
non Bernoulli random variables. Furthermore, these confidence intervals do not take into 
account the randomness of the number of displays made to users. 

The litterature lacks of a formal modeling of the AB-test process. Previous works such as 
Crook et al. (2009); Kohavi et al. (2009, 2012) mainly focus on applications of this method 
and do not provide a well-defined statistical framework for the results’ analysis. Most 
available sources for the practitioner are online calculators only dedicated to the Bernoulli 
case. A primer of the underlying theory applied to AB-test analysis is only given in online 
references such as Amazon (2010) but they do not go deeply into the statistical modeling 
and do not cover more general metrics than simple sums of independent Bernoulli random 
variables. In this paper, we introduce a formal framework for the AB-test process modeling 
only involving assumptions consistent with the data-driven paradigm. It allows us to prove 
some statistical properties of the involved estimators, including those based on ratios, and 
to get numerical methods to approximate the variances involved in the related central limit 
theorems. We also go beyond that by justifying the use of the bootstrap algorithm (Efron 
and Tibshirani, 1993) to compute confidence intervals for absolute and relative increments. 

The mathematical formalization of the AB-test framework is given in Section 2. In 
Section 3, we provide exact asymptotic confidence intervals for any kind of metric that is 
obtained by summing quantities over the users, and for any metric computed as the ratio 
of such sums. We also get exact asymptotic confidence intervals for both their absolute 
and relative increments under few assumptions, most of them directly related to the AB- 
test process. Explicit estimators for the related asymptotic variances are provided. We 
additionaly show how to use bootstrapping to get confidence intervals when the data cannot 
be grouped by user, as is commonly the case in the big-data field. Section 4 numerically 
validates our assumptions and the proposed algorithms, while Appendices A and B give 
formal proofs of the technical results of Section 3. 

2. Mathematical Formulation of the AB-test Process 

In order to translate the AB-test process into a mathematical framework, we introduce some 
random variables modeling the metrics that one wants to evaluate and the way in which 
the users are separated into two populations. 
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More precisely, let P) be a probability space and E[-] the expectation operator 

under P. We define a sequence of random vectors on x {0,1}^ 






For each user i > 1, ef and ef indicate the population that has been selected for this user: 
ef = 1 (resp. ef = 1) if and only if the user i is in population A of size ratio oa £ [0,1] 
(resp. B of size ratio as G [0; !])• Note that in general we will have a a + as = 1 but this 
is not required and our analysis also applies to tests involving more than two populations. 
The other variables model metrics of interest for the AB-tester. X^ and X^ are the same 
metric generated by the user i if he was applied to systems A and B respectively. The same 
stands for Y^^ and Y^^ which model another metric. 


Example 1 (Comparison of revenue) When the AB-tester wants to compare the rev¬ 
enue generated by algorithms A and B, he compares the total revenue of each population, 
normalized by their ratio. They can be written: 


1 

OLA 



and 


-L sr 

as 


B_ 


X, 


B 


*kf=l 


if Xf and X^ are the revenues generated by user i under systems A and B respectively. 
Note that, in practice, we can also normalize the total revenues by the real population sizes 
instead of their ratios and the quantities to compare become: 


Eii.f.i 1 


and 


.1 X? 
Ei|,f .11 


Example 2 (Comparison of CTR) When the AB-tester wants to compare the CTR gen¬ 
erated by algorithms A and B, he compares the CTR of each population. They can be 
written: 




and 




if Xf and X^ are the clicks generated by user i, and Yf^ and Y^^ the number of displays 
shown to the same user under systems A and B respectively. 


We introduce the following assumptions that will be easily followed in an AB-test setting. 


A1 The random vectors {Xf-,Yf^,X^,Y^^ i>i, are independent and identically 
distributed. 

A2 The random vectors {Xf,Yf^,Xf,Y^) and are independent. 

A3 The random variables [X^ ,Y^, X^,Y^) are L 2 -integrable and we define 

mxA E [Xf] , my A E [t/] , E [xf] , myB E , (1) 
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4^ =Var(Xi^) , 
PA 

A4 The random variables 
most surely zero, that 


CTyA = Var (y/) , a^-B = Var (Xf) , Uys = Var {Y^) , 

( 2 ) 

def Cov {Xt, y/) def Cov (Xf, Y^) 

= - , PB = - • (3) 

(JX^^Y^ (7j5^S(7yB 

(Xj^, y^^, Xj®, y,^) are almost surely non-negative and not al- 


is 


P{Xi^ < 0} = 0 , 

F{X^ > 0} > 0 

< 0} = 0, 

> 0} > 0 

o 

II 

V 

p{xf > 0 } > 0 

< 0} = 0, 

P{Yf > 0} > 0 


A5 The random variables (e^,ef) satisfies: 

1. El and ef follow Bernoulli laws of respective parameters a a and as- 

2 . efsf = 0 . 


A user can only be assigned to one population, which is ensured by Assumption A5-2. 
Assumption A5-1 sets the ratio of populations A and B to be respectively a a and as- 

Assumption A2 reflects the fact that the population attribution process does not affect 
the user reaction to the applied system while Assumption A3 is purely technical. This is 
the only assumption that is not implied by the AB-test process but it will guarantee the 
convergence of the estimators. Assumption A4 is consistent with the metrics that we are 
studying. They will typically be zero with a high probability and positive otherwise (for 
example, the number of clicks). 

Finally, Assumption Al models the un-identifiability of the users. They are all inde¬ 
pendent and, without prior knowledge, identically distributed. The whole AB-test process 
relies on this assumption by randomly splitting the users into two populations. 

It is worthwhile to note that the metrics of interest (X/^, Y/^, ,Yi^)i>i defined for 
each user and for each system, independently of the population split. The AB-test process 
will give access to only X^ or X^ for a given user i, but they can still both be defined 
even when they are not observed. This is the main interest of this modeling that allows us 
to write those variables independently of the population. Furthermore, we circumvent the 
issue of having hidden variables by introducing a new set of variables that will always be 
observed. To that purpose, we simply set Xf to 0 when it is not observed, i.e. when the 
user i is not in population A. This is formalized in the following definition. 


Definition 1 For each user i > 1, we define 

e^Xfi -4 def 


XA 


aA 


A 

^ i 


aA 


xf = 


as 


BvB 


-. r 


OiB 


Remark 2 We trivially obtain from Assumption Al that the random vectors {Xf,Yi^^ 
are independent and identically distributed. 
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Using Definition 1, sums of the form 


1 

Q-A 



can by re-written in a more appealing way as 

n _ _ _ 

E • 

i=\ 


where the random variables are summed on all the users independently of their 

population, which leads to the following sum definitions for any number of users n G N: 


def 1 (fef 


n 


EU 


2=1 


def 1 def 



2 = 1 


( 4 ) 


In the case of Example 1, we will have to compare either two sums over the same indices 
and when the normalization is done by the population ratios ; or two ratios of 
sums over the same indices and , where = 1 and = 1, when 

the normalization is done by the real population sizes. In the case of Example 2, the ratios 
to compare become similarly and . 

Writting the estimators this way validates the use of the bootstrap technique (Efron 
and Tibshirani, 1993) to get confidence intervals. For the relative increments of the metrics 
of interest, this can be done through the study of ratio: 


qXB oXB/ qYB 

^ and _ ( 5 ) 

qX^ / 

Three algorithms will be derived in the following Section to get confidence intervals on such 
quantities. 


3. Estimator Convergence and Algorithms for Confidence Intervals 

The previous modeling has been designed to translate AB-test metrics into functions of 
sums of i.i.d. variables as in (5). The i.i.d. property allows us to design and validate a 
bootstrap technique to get confidence intervals, and dealing only with sums adds the ability 
to derive central limit theorems for all the metrics and their increments (both absolute and 
relative). 

3.1 Confidence Interval Computation 

According to Remark 2, the random vectors (X4, are i.i.d., and by Defini¬ 

tion 1 we have for i > 1 



1—1 1 
V 

Y/ 

A 

1 


< — \xF\ , 


< — 

2 

- i 7 

aA 

2 

- t 7 

aA 

2 

aA 

aA 
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Estimator 

f{x,y,x\y') 

F{D) 

CLT 

qXB qX^ 

x' — X 

TTI^B — TTl^A 

Prop. 5 

q X~b / c X'^ 

x' jx 

mxB fruxA 

Prop. 6 

oXB / qYB qX^ / qY^ 

*^n / *^n / *^n 

x'fy' -x/y 

ruxB / ruy s — nix A / myA 

Prop. 8 

cX^ /qY^ 

/ *^n 

x'fy' 

mxB jniYB 

Prop. 9 

/oY^ 

xfy 

mxAjniYA 


Table 1: Different estimators of interest 


Assumption A3 then shows that {Xf, TA, AA, TA)j>i are Li integrable. We thus can apply 

the law of large numbers to the sums of interest ) and show that they 

converge to {mxA,mYA,mxB,mYB). We then get that for any continuous function /, the 

quantity is a consistent estimator of f{mxA,mYA,mxB,mYB). 

The case of a ratio is dealt with by introducting the following transformation. 


Definition 3 We define the function ip from M to M* defined by 


Vx G M , ip{x) = 


1 ,ifx = 0 , 
X , if X ^ 0 . 


We will apply (p to all the denominators in the following theorems, and, according to the 
positiveness ensured by Assumption A4, the ratios are continuous functions of the non-zero 
sums. It is only a technical point, as in practice we would not define the ratio for a null 
denominator. In theoretical applications. Lemma 10 in Appendix A allows us to replace 
the sums by their non-zero versions obtained by applying the operator p, but for the sake 
of simplicity we will not use it when describing the bootstrap. 

If we denote by D the distribution of {Xf, Xf,Y^), then all the quantities that we 

are estimating can be written as a functional F{D) f{mxA,mYA,mxB,mYB), and their 
estimators are asymptotically normal as shown in the relevant Propositions of Section 3.2. 
The link between estimators, /, F, and their central limit theorem result is summarized in 
Table 1. 

Bootstrapping In this specihc framework, bootstrapping can be used by randomly se¬ 
lecting n users (possibly picking the same user several times) and computing the estimator 
with this random set of users. Repeating this M times provides an empirical distribution 
of the estimator of F{D). The M estimator values can be computed with only one pass on 
the dataset using an online version of bootstrapping described in Oza and Russell (2001); 
Oza (2005). 

For each user i, a Poisson random variable Zi is simulated and the current user is 
included Zi times. The full procedure is detailed in Algorithm 1 and works well even if the 
dataset is not grouped by user. In this case, each line I of the dataset is associated to a user 
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Algorithm 1 Online bootstrapping 


13 

14 

15 

16 

17 

18 


Inputs: A dataset {Ii, ef^xy, ef^yf, ef^xf, and random variables 

T -j.- )• -I- o j. /-n \l<m<M A i\^ j. ■ c i.- a. 


3 

Loop on the data set: 


4 

for 1 from 1 to 

L do 



5 

Set i = Ii- 




6 

for m from 

1 to M 

do 


7 

Set Tym 

= 

+ zr 

efxflaA 

8 

Set r 2 ,m 

= r2,m 

+ zr 


9 

Set Ta^m 

= 

+ zr 

efxffas 

10 

S6t r4,m 

— 

+ zr 


11 

end for 




12 

end for 





Computation of the estimators 
for m from 1 to M do 
Setnrn = T:^=,Zr. 

Set Fm = f i,r‘ 

end for 

Outputs: 


Fr, 


M 


m=l 


{zr 


\2=l:n 


i = Ii and contains a vector {ef^xf,£f^yf,ef^xf,ef^yf) such that for any i>l 


i=i\ii=i 

E 

i=i\ii=i 




l=l\Il=i 

yF= E yf 

i=\\ii=i 


It relies on a pseudo-random generator that is able to generate M Poisson variables {Z'^)i<rn<M 
for each user i. 

Confidence interval algorithms The M estimators ( Fm ) obtained in Algorithm 1 

\ / m=\ 

can then be used to derive empirical quantiles and obtain confidence intervals with Algo¬ 
rithm 2. However, quantile approximation for accurate confidence intervals requires M to 
be big enough and Algorithm 2 is only feasible if the number of users n is small enough. 

Another way of computing confidence intervals is to use one of the central limit theorems 
stated in Section 3.2 on the condition that the implied variances can be easily estimated 
from the data. The resulting algorithm is given in Algorithm 3 where we use the normal 
cumulative density function N defined by 


Vx G M , 



( 6 ) 
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Algorithm 2 Confidence interval with bootstrapping 


^\M 

1: Inputs: The bootstrap distribution [Fm] given by Algorithm 1 and a confidence 


/ m=l 


level q. 

2: Compute as the empirical quantile of 

3: Compute Fniax as the empirical quantile of (^F„ 
4: Outputs: [-^min’^max]- 


M 

m=l 

M 

m=l 


of order (1 — q)/2. 
of order (1 + q)/2. 


Algorithm 3 Confidence interval with CLT 
1: Inputs: and a confidence level q. 


2: Set I 


3: Estimate the asymptotic variance using the relevant Proposition (see Table 1). 
4: Outputs: 


f I qX^ qY^ qXB qY^ 


— f I 

J I *Jn i^n )'^n i 


+ SGn 


In practice, the data is not aggregated by user and we have to do so as a first step 
in order to get the vectors and estimate the related variances and 

covariances. This can be quite costly as it requires more than one reading of the dataset if 
the user can be found in several lines. In the case where each user appears only once, this 
will be the quicker algorithm as it does not need any simulation. 

We can take advantage of both Algorithms 2 and 3 by using bootstrapping to approxi¬ 
mate the estimator variance and the asymptotic normality to derive confidence intervals as 
described in Algorithm 4. The variance estimation only requires a few number of bootstraps 
M and the dataset is read only once. This algorithm will be shown in Section 4 to perform 
better than Algorithm 2 for a given computational cost. Though, this algorithm relies on 
an asymptotic regime and is relevant only when the number of users n is large enough. 
Otherwise, pure bootstrapping may be a better alternative as it works for any value of n. 


Algorithm 4 Confidence interval with bootstrapping and CLT 


1: Inputs: The bootstrap distribution 
level q. 

2: Sets‘'=^'A-i^^ + '^ 


(Fm) gl 

\ / m=l 


given by Algorithm 1 and a confidence 


3: Set ( cr. 


F \ 


2 

1 


M 


1 


M 


M - 1 


m=l \ P=1 


4: Outputs: 


f I qX^ qY^ qXB qYB 


— orrF f / qy^ q^^ qy^ 


+ sal 
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3.2 Central Limit Theorem 

We now check that the estimators given in Table 1 all satisfy a central limit theorem. For 
improved readability, proofs have been postponed to Appendix B 

Theorem 4 (Central limit theorem) Under Assumptions Al-5, the vector ^ , S^^), 

defined in (4), satisfies the following central limit theorem 


n 


^ - mxA 

- my A 

- mxB 

\ — myB 


/ 

(^\ 


0 

0 

\ 

V 0 / 


_\ 

(e (xAyv‘‘Af.ri»)) 

/ 


where S Xf is the covariance matrix of (^Xf^Y^^X^^Y^'^ defined by the 

variances 


2 def a \ ^ 2 1^ 2 

a- = Var (Xy) = —■ 

Aa Var (^1^1 - —<7ya + ——rriYA , 

\/aA OlA 

2 def ^ 2 1 ^ 2 

<r- = Var (x, ) = -<r,., + . 

cr~ Var (v®) - —cryB + -——rriyB , 

\ J as O-B 

the covariances inside each population 

Cov (^X^, = f^pAfyxAcryA + ^ ffi mxAmyA plcr^a^ , 

Cov (x^,Yf) = —pBCTxBCTyB + - - —mxBmyB PBcr^a:;^ , 

\ J as OLB ^ ^ 


( 7 ) 

( 8 ) 
(9) 

( 10 ) 


( 11 ) 

( 12 ) 


and the cross population covariances 


Cov yX^Aij = —mxAmxB , 
Cov (^X^,Y^'^ = —mxAmyB , 
Cov (Yf^,Xf^ = —myAmxB , 
Cov = —myAmyB . 


The convergence is done at rate y/n where n is the total number of users, and not the 
number of users in a population. However, the variance of each estimator decreases with 
its relative population size thanks to factors a a and as found in the denominators of the 
four variances. 
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Furthermore, these variances are composed of two terms. One that comes purely from 
the variance of the metrics of interest (ex: ct^a) and another one added by the AB-test 
process which randomly attributes each user to a population (ex: (1 — aA)'m^A)- They 
can be understood when looking at extreme cases. When population A includes all the 
users, i.e. aA = 1, the randomness of the AB-test process disappears and we simply get 
Var = On the other hand, if the metric of interest is purely deterministic, 

let’s say = 1 in which case we are interested in the number of users per population, 
then the variance becomes which is the variance of jaA- However, in practice, we 

often have ctj^a >> mxA and the second term becomes almost negligible. 

Another fact shown by Theorem 4 is that the metrics of the two populations are not 
independent! This was actually intuitive as when a user is associated to one population and 
thus included in the corresponding sum, the other population looses this user. If tuxa and 
mxB are positive, then the correlation is negative following the previous intuition. 

Finally, Theorem 4 provides the asymptotic distribution of the joint law of the four 
empirical averages we are interested in to compare the two populations. Simple linear 
combinations such as remain asymptoticaly normal and confidence intervals 

can easily been derived as stated in Proposition 5. This allows for comparing, for example, 
the absolute increment of the number of displays per user generated by the two algorithms 
A and B. 


Proposition 5 (CLT for f{x,y,x',y') = x' — x) Under Assumptions Al-5, the absolute 
increment satisfies the following central limit theorem 


qXB 


— {ruxB — rnxA) 


V 


w(o. 




where and are defined respectively in (7) and (9). 


XB 


When coming to conhdence intervals for relative increments such as , or for 

ratio metrics such as , without further steps, one would need to compute quantiles 

of the ratio of two correlated normal random variables. This problem is known to be difficult 
and has been discussed for decades, see Marsaglia (2006) and references therein. 

However, such ratios can themselves be shown to be asymptotically normal in our setup 
as stated in Propositions 6 and 7. 

Proposition 6 (CLT for f{x,y,x',y') =x' jx) Under Assumptions Al-5, the ratio 
satisfies the following central limit theorem 



"^1 AAAfo,f”^ 

mxA j KmxA 


mxA 


2 

+ 


(m^s 


XB 



where and are defined respectively in (7) and (9) and ip in Definition 3. 


Following similar steps, we can now derive central limit theorems for ratio of the form 
which allows us to get confidence intervals for metrics such as CTR as in Example 

2 . 
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Proposition 7 Under Assumptions Al-5, the ratio /if and /ip sat¬ 

isfy the following eentral limit theorem 

myA 


( m ^ \ 




{sr) 


::XB 


(®”") 




myB 

myB 




with 


ymyA J 


mxA ) ymyA J 


- ‘ipA 


^ X^ '^yA 


Vb = 


/ mys V 

ymyB J 


TYIj^A j \TTIy^ j TTly^A TflyA 

\mxB J ymyB J mxB myB 


(13) 

(14) 


where fj- 
and (12). 






j, (T^, PA and pb are respectively defined in (7), (8), (9), (10), (11) 


One can remark that whereas and are asymptotically correlated, as well as 
and S//^, the ratio /p and /p are not. This can be explained 

by recalling that the correlation of the non-ratio metrics is due to the fact that adding a 
user to one sum, excludes him from the other one, resulting in a negative correlation. On 
the contrary, ratios inside each population are independent of the scale of the individual 
sums, and their correlation vanishes asymptotically. 

We can now derive central limit theorems for both the absolute and relative differences 
of ratios. This is done in Propositions 8 and 9 respectively. 

Proposition 8 (CLT for f{x,y,x\y') = x'/y' — x/y) Under Assumptions Al-5, the ra¬ 
tios /p and /p satisfy the following central limit theorem 


S. 


XB 


S. 


XA 


TTIj^B tti^a 


V 


M{0,Va + Vb) , 


v{sp)/ V”'" 

where Va and Vb are defined respectively in (13) and (14). 

Proposition 9 (CLT for f{x,y,x',y') = Under Assumptions Al-5, we have the fol¬ 

lowing central limit theorem 

Sf‘/v(sr) 
y(sr)/f’(sr) 

A v(o. V 

\ ynixA/myA J 

where Va and Vb are defined in Proposition 7. 


( wi y ^ ( WE \ 
\mxA/myA J ymxB/myBj 
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3.3 Variance Estimation 

Algorithm 3 defined in Section 3.1 relies on estimators of the asymptotic variance given in 
Propositions 5, 6, 8, and 9. All the related variances are given as a continuous function 
of nix A, my A, mxB, myB defined in (1), and <7^, respectively 

defined in (7), (8), (9), (10), (11) and (12). According to the continuous mapping theorem, 
we thus only need to get consistent estimators [n], [n ], [n], [n], [n], 

VA ! [^]! [^]) (a [^]) ?B [n] of these ten quantities to derive consistent estimators 

of the asymptotic variances stated in Propositions 5, 6, 8, and 9. 

The mean estimators are easily obtained from , and 


- ~ r 1 def ciX^ —^— r 1 oY^ -r l oX^ -- r 1 ^ef c-y« 

mxA[n\ = , mYA[n\ = , mys[nj = . 

The variance estimators can be computed directly from the random variables {Xf^, Xj^, Y^)i>i 
without estimating in a first step axA , ay a , axB , ays : 




= 

i=\ 

1 ^ 


2 def 1 

cT^mi = 


2=1 




n — 

def 1 


2=1 

n 


n — 


2=1 




Finally, the correlation estimators are obtained in a similar way: 


~r 1 def 

PaH = 




E(y 


~r n def 

PB[n\ = 




iE(v 


h7^[n]) . 


4. Numerical Application to CTR Confidence Intervals 

We use a real dataset described in Section 4.1 to numerically demonstrate the proposed 
algorithms. Blank AB-tests are simulated over this dataset to validate the user independent 
assumption in Section 4.2 and to compare the bootstrap algorithms in Section 4.3. Blank 
AB-tests are of particular interest here since we know that whichever the metric of interest, 
its increment should be 0. This allows to easily check that a given confidence interval 
contains the true value it aims to estimate. 


4.1 Dataset Description 

The dataset used in this paper is publically accessible from the KDD Cup website KDD 
(2012). It has been built out of search session log messages containing one line per search. 
Each line provides the user id, the number of displays and the number of clicks associated 
to the current search session. Other information are available in the dataset but are not 
relevant for this study. The lines are not grouped by user and the same user can be found 
in different and separate search sessions. Due to the large number of simulations run in 
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Figure 1; Number of clicks per users with at least one click 


this section, we kept only the first 1 million users out of 22 million, sorted by lexicographic 
order on the user id. An extract of the dataset is shown in Table 2 and some statistics 
are available in Table 3. Furthermore, the distribution of the number of clicks per user 
(knowing the user has clicked at least once) is displayed in Figure 1. It illustrates the fact 
that this number of clicks cannot be approximated by a Bernoulli law. 


Userid 

NbDisplays 

NbClicks 

10000244 

1 

0 

10000148 

3 

1 

10000089 

1 

0 

1000026 

6 

0 

1000002 

1 

0 

1000002 

1 

0 

10000315 

1 

0 

10000925 

3 

2 

10000185 

1 

0 


Table 2; Dataset sample 
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In the next Sections, blank AB-tests will be simulated from this dataset to compare the 
CTR (Click-Through Rate) of each population, defined as the average number of clicks per 
display: 

drf NbClicks 
NbDisplays 


Number of users 

1000000 

CTR 

4.4% 

Number of displays 

4332627 

Displays per user 

4.3 

Number of clicks 

191892 

Clicks per user 

0.19 


Table 3: Dataset statistics 


4.2 User Independence Assnmption 

In order to validate Assumption Al and to show that it cannot be approximated by an 
independence of the displays, 500 blank AB-tests were simulated^. For each AB-test, confi¬ 
dence intervals at different levels (from 50% to 99%) were computed for the absolute CTR 
increment CTR^ — CTR^ using two methods. The first one assumes that the displays are 
independent implying an asymptotic variance of 

CTRA (1 - CTRa ) CTRb (1 - CTRb ) 

NbDisplaysA NbDisplays b 

This is the formula usually given when describing AB-test analysis. The second method 
assumes that the users are independent and is described in Algorithm 3. If the variables 
model the following quantities 

• Xf^: number of clicks from user i if system A is applied, 

• Y/^: number of displays shown to user i if system A is applied, 

• Xf: number of clicks from user i if system B is applied, 

• Yj^: number of displays shown to user i if system B is applied, 
then the CTR of each population can be written 


cX^ cXB 

CTRa = , CTRb = , 


and the asymptotic variance of CTRb — CTRa is given in Proposition 8. 

The true value of the absolute increment is known to be 0 and, for each conhdence level, 
we give the percentage of AB-tests for which the confidence interval contained 0. The closer 
this percentage to the target confidence level, the better the underlying method. Results 
for both assumptions are shown in Figure 2. Assuming independence of displays leads to 

1. Experiments have also been made for 300 and 400 blank AB-tests and the results were very similar. 
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Figure 2: Display VS User independence assumption 
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Figure 3: Empirical vs binomial number of clicks distribution 


under-estimating the AB-test noise, and increments appear significant much more often 
than they should be. For example, a 95%-confidence interval includes the true value in only 
59% of AB-tests which contradicts the definition of a confidence interval. On the contrary, 
the assumption of user independence leads to the expected conclusion of having almost 95% 
of 95%-confidence intervals including 0 and it remains true for all other tested levels. 

This under-estimation is explicitly illustrated in Figure 3 where the empirical distri¬ 
bution of the number of clicks (obtained by bootstrapping) is compared to the binomial 
distribution implied by the display independence assumption. It shows that the empirical 
standard deviation is much higher than the binomial one (twice as big iN this example). 

4.3 Comparison of Bootstrap Algorithms 

The assumption of independence by user having been validated, we can now focus on the 
comparison of the proposed algorithms. The method using only the central limit theorem 
will be given as a reference but is not of practical interest here as the dataset is not grouped 
by user (see Section 3.1). We are thus more interested in comparing Algorithms 2 and 4 as 
they can be implemented in a such a way that the dataset is read only once. Each algorithm 
uses bootstrapping, having a computational cost linear in the number of bootstraps M. 
Similarly to Section 4.2, 500 blank AB-tests were simulated from the dataset described in 
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0 ) 



50% 55% 60% 65% 70% 75% 80% 85% 

Desired confidence level 


90% 95% 100% 


-Theory -•-CLT a CLT+Boot(M=10) -■-Boot(M=10) 


Figure 4: Bootstrap algorithms’ performance with M = 10 


4.1 to compute confidence intervals for CTR relative increment 

CTRb ^ ^ 

CTRa s^^/sr 


where {Xf,Y^,XP,Y/^) i>i are defined in Section 4.2. According to Proposition 9, this 
estimator is asymptotically normal and its average should be 0 for a blank AB-test. The 
frequency of confidence intervals including the true value 0 is displayed in Figure 4 for 
different levels of confidence and for both the pure bootstrap technique with M = 10 
(Algorithm 2) and the technique using the bootstrap variance in the CLT (Algorithm 4) 
again with M = 10. As expected, for a small number of bootstraps M = 10, the pure 
bootstrap algorithm performs poorly and is able to get an acceptable conhdence intervals 
for only a few confidence levels, while the algorithm using both CLT and bootstrapping 
shows good results for all confidence levels for the same computational cost. In Figure 5, 
we show the influence of the number of bootstraps M in the ability of each algorithm to 
compute reliable 95% confidence intervals. The pure bootstrap algorithm converges more 
slowly to the target 95% value and requires twice the computational cost as the mixed 
algorithm. 
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Figure 5: Bootstrap algorithms’ performance for different values of M 
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5. Conclusion 

We have translated the AB-test process into a statistical framework, providing three algo¬ 
rithms for the computation of confidence intervals. Each of them are useful for different 
practical cases: 

1. if the number of users n is small, pure bootstrapping is the best choice (see Algorithm 
2), and a large number of bootstraps M is tractable; 

2. if the number of users n is large and the dataset is grouped by user, then one should 
use one of the relevant central limit theorems (see Algorithm 3); 

3. if the number of users n is large and the dataset is not grouped by user, the algorithm 
using the bootstrap variance in the central limit theorem will result in the smallest 
computational cost (see Algorithm 4). 

Numerical experiments allowed us to check that our assumptions were valid. We focused 
on the CTR computation, but, as stated in the theoretical parts, the proposed algorithms 
apply to any metric that can be written as a sum or a ratio of sums, e.g., to the sales 
amount spend per user as well as the revenue generated per user. Similar numerical results 
allowed us to validate the algorithms. 

It is worthwhile to note that the provided algorithms lead to results valid only during the 
AB-test but do not extend to the future. This is known as the long term effect as discussed 
in Kohavi et al. (2009). Addressing this issue would require additional assumptions on the 
metrics of interest, such as time series modeling, and is out of the scope of this paper. 
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Appendix A. Convergence Resnlts 

The notations used here are independent from the ones defined in the other sections as the 
following propositions are general results on random variable convergence. We only keep 
the definition of (p given in Definition 3 that is widely used when dealing with ratios. 

All the random variables will be assumed to be defined on a probability space (0,7^, P) 
and the expectation operator under P will be denoted by E [•]. 

Lemma 10 Let (Yn}n>i be a sequence a real random variables converging in probability to 
a real constant y / 0. Then the sequence {p> (T„))n>i also converges in probability to y 
where ip is defined in Definition 3. 

Proof By the triangle inequality, we have, for each n > 1 

\p (Yn) -y\ < \p{Yn) - Yn\ + \Yn - y\ = 1y„=0 + |Yn - y| , 
implying that for each e > 0 

¥{\p{Yn)-y\>e}<¥{lY^=o>e}+¥{\Yn-y\>e} , 

IP 

where the second probability converges to 0 by definition of Yn — y and the first one is 
bounded by 


P{1y„=o >e} <E{Y„ = 0} , 

<P{|y„-y| > \y\/2} , 


where the last probability converges to 0 by definition of Yn 



Lemma 11 Let • • • , Xn,Yn)n>i be a sequence a random variables in such that 

P 

1. Yn —)• y where y is real constant such that y 0, 

2. There exists c G [0,1) such that PjEn = 0} < c^, 

3. There exist (xi, • • • ^xfij G and a random variable V in 

V^iXi-xi,--- ,Xi-Xd,Yn-y)^V . 

then the assertions 1 and 3 are satisfied with p (Yn) where p is defined in Definition 3. 

Proof Assumption 1 and Lemma 10 directly give p (Yn) —> y. 

In order to proove the distribution convergence, we use the portemanteau lemma by 
showing that for all bounded Lipschitz function /, E [f{^/n{Xn — xi, • • • , — Xd, p (Yn) — y))] 
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converges to E [f{V)]. Let / be a bounded and Lipschitz function, we have 


E 


< E 


(Fn) - y))l - E [fiV)] 


< E 


-XI,--- ,X^-Xd,^ {Yn) - y)) - f{V) 


f{Vn{Xl - xi,--- ,X^- xd,(p (Yn) -y)- f{y/n{Xl -xi,--- ,X^- Xd, Yn - y)) 


+ E 


f{^{Xl-xi,--- ,Xt-Xd,Yn-y))-f{V) . (15) 


According to Assumption 3, the second term of the right hand side of (15) converges to 0. 
The first term is handled using the Lipschitz property of /: there exists a constant K such 
that for all (a, h) G (M'^+^)^, |/(a) — f{h)\L^ < K\\a — 6||lj so that 


E 


f{\/n{Xl-xi,--- ,X^-Xd,‘P (Yn) -y)- f{^{Xl -xi,--- ,X^-Xd,Yn- y)) 


< AT^/nE 


{Xl-xi,--- ,Xf^-Xd,y^ (Yn) -y) - {xl-Xi,--- ,xf^- Xd,Yn- y) 

= iC^/^E [|(^ {Yn) - Tnl] = [1y„=o] = {Yn = 0} , 

< K\/nc^ , according to Assumption 2, 


which shows that the first term of the right hand side of (15) converges to 0 and that 


Vn{Xn -xi,--- ,X^-Xd,(p (Yn) -y) ^V. 


Proposition 12 Let {Xn, Yn, X'^, Y^)n>i be a sequence a random variables in {x, y, x', y') G 
and S a 4 x 4 covariance matrix such that 

1. y ^0 and y' / 0, 

/I -lA P T P / 

2. Yn —>y and Yn —> y , 

3. There exists c G [0,1) such that E {Yn = 0} < c"" and E {Y^ = 0} < c", 

4- The sequence {Xn,Yn, Xn,Yn)n>i satisfies the following central limit theorem 


/ Xn-X \ 


/ 


\ 

Yn-y 

xL-x' 

V yL-y' ) 


V 

0 

0 

V 0 ) 

/ 


Then the ratio sequence {Xn/^p (Yn) , X'n/p {Yf))n>i satisfies the following central limit the¬ 
orem 

/ ^ n \ 


/ X. 


\ 

P (Yn) y 
XL x' 


V 


M 


Q ] , P'^YP 


where P 


\ p{Yk) y' / 




y. 

0 

0 


0 

0 

1 


(yr / 
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Proof We first rewrite 


so that 


where 


^{Yr.) y 


^{Yn) 


X 

V 




+ 


^{Yn) ^{Yn) ^{Yn) 


X 

y 


and simarly 




y 1 


(f {Yn) y 


-{Xn-X)- 


X 


^{Yn) 2/2 


{ip (Yn) - y) 


‘fiY') 




y ^ fv' y ^ t fvf\ '\ 

{Y^dy'^ " piYDiy'Y^"^^ 


/ X. 


\ 


y={Yn) 

XL 


= 

r» 


\ y^{Yn) y' ) 

( y ^ 


p 

77. - 


V 


(Yn) y 
y X 

^ {Yn) y"^ 
0 

0 


/ Xn-X \ 

‘P (Yn) - y 
XL-X' 

\ p{YlO-y' 

0 ^ 


y 


1 


Y{Y(,)y' 

y X 

'p(K)W / 


It It Jr 

By applying Lemma 10, p {Yn) —)• y and p {Y^) —> y' so that Pn —?■ P- Further¬ 
more, using Lemma 11 twice, we successively get that (Xn, p (Yn), XL,YL)n>i and then 
{Xn, p (Yn ), XL, p (YL))n>i Satisfy the CLT stated in Assumption 4. We then only need to 
apply the Slutsky lemma to conclude. ■ 


Corollary 13 Let {Xn,Yn)n>i be a sequence a random variables in {x,y) G y; 

a 2 X 2 covariance matrix such that 

1- y^O, 

p 

Yn A y, 

3. There exists c G [0, 1) such that PlPn = 0} < , 

4- {Xn,Yn)n>i Satisfies the following central limit theorem 
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Then the ratio sequence {Xn/ip {Yn))n>i satisfies the following central limit theorem 

( 1 




X\ V 


T{Yn 


def 


AA(0,P^EP) , where P = 


\ 


Proof This is a direct consequence of Proposition 12 by keeping only the first marginal of 
the ratio couple. ■ 


Appendix B. Proofs of Central Limit Theorems 


Proof [Proof of Theorem 4] The vector ) is made of empirical means of 

random variables ,Y^^)i>i. According to Definition 1 and to Assumption Al, 

these variables are i.i.d. and by the same definition they are centered on {mxA, myA^mxB , myB). 
Furthermore, one directly sees that 







< — 
aA 


Xf 


< — l^f I 

OA 


Yf 


aA 


Bl 


which, combined with Assumption A3, shows that (Xf,Xf,Yj^,Yj^) is L 2 -hrtegrable. We 
then can apply a multi-dimensional version of the central limit theorem to get the announced 
convergence in distribution result. 

It now only remains to calculate the related variances and covariances. By Definition!, 
we have 


Yar(x-)=Yar(£^) . 

= 4 {e [{etfixtf] -E[stxff} , 

a A '' ■> 

= |e [e^] E [(Aj^)^] — E [e^]^E , by Assumption A2, 

a A ^ I 


1 

OLA 

1 

aA 

1 

OLA 


E[{Xff]-m^xA, 

[axA + 1X1 Xa] - IB 
2 1 “ OLA 2 

a ya T myA 

aA 


2 

XA ) 


according to assumation A3, 


The same stands for Var ) Var , and Var very similar steps allows 

to get the values of Cov (^Xf, and Cov (^Xf, Y^^. 


Using again Definition 1 and Assumption A5, one gets 


Cov Af,Af =E 




A vA vB 
1 -^1 


OLA OLB 
= —TfixAfnxB , 


-E 




E 


Af 


as efef = 0 , 
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and the same formula can be derived for Cov (^Xf, , Cov (y^, Xf^ and Cov (y^^ Y^ 


Proof [Proof of Proposition 5] Define a continuous function g from to M by 

y{xA,yA,XB,yB) e , 

g {{XA, yA, XB, yB^) *== XB -XA = (-1, 0, 1, 0) X {XA, yA, XB, yB^ ) 


so that 


qXB qX^ 


- {mxB -bixa) 


= 9 


S^-' - 




nixA 


\ 


— my A 

- mxB 

\ Si’' - niyB J 


\ 


Then, by the continuous mapping theorem and Theorem 4, ^/n — 5*^^^ — {mxB — my a) 

converges in distribution to a normal random variable of mean 0 and variance 


(-1, 0,1, 0)S [xl, Yl, Xf, Yf^ (-1,0,1, 0)^ . 


Before moving to proofs of ratio CLT, we need two intermediary Lemmas. 

Lemma 14 Under Assumptions A3-4, we have 

mxA > 0 , my A > 0 , mxB > 0 , myB > 0 . 

Proof According to assumpation A4, X^ > 0 almost surely, which implies that = 

E > 0. Furthermore, by the Markov inequality, for any n > 1 we have: 

> 1/n} < nmxA . 

If mxA = 0 then for any n > 1, P {Xf > 1/n} = 0 and thus P {Xf > O} = 0 which is in 
contradiction with Assumption A4. ■ 


Lemma 15 Under Assumptions Al-5 there exists a constant c G [0,1) such that 

= o} < = o} < o’" , p{S;f® = o} < = o} < c” 
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Proof We have 

P{<5r = 0}=p{^ = 0}” , 

= P = O}” , by Definition 1, 

= [i-pKx,^>0}]" , 

= [l-PK>0,Xi^>0}]” , 

= [l — P {e^ > 0 } P {Xi > 0 }]" , by Assumption A2, 

= [1 — a^P > 0 }]” ) by Assumption A5, 

where 1 — oaP > O} G [0,1) by Assumption A4. The same steps applied to 
and achieve the proof by setting 

c 1 - min [oaP {X^ > O} , oaP {d/ > O} , a^P {xf > O} , o^P [vf > O}] . 


Proof [Proof of Proposition 6 ] The proof is a direct application of Corollary 13 of Appendix 
A with Xn = and . Its assumptions are all satisfied: 

1. mxA 7 ^ 0 by Lemma 14, 

2 . According to the weak law of large numbers, S'„ —)■ nixA, 


3. 


' = o| < c” according to Lemma 15 


4. According to Theorem 4 
~ ruxB 


n 


Sn'" - mxA 
Corollary 13 states that 


V 


M 


XB 

-mxAmxb 


-mxAmxb 


n 


Sn’' rnxB 




mxA 


V 


W 0, 


CJ~ 

XB 

-nixAmxb 


-mxAmxb 

-h 


P 


where P 


nixB 


m-xA 


m 


XA 


Proof [Proof of Proposition 7] The proof is a direct application of Proposition 12 of Ap¬ 
pendix A with Xn = Sn^ , Y-n = <5^^, X'n = Sn^, and Yn = Sn^ ■ Its assumptions are all 
satisfied: 


1. my A 7 ^ 0 and niyB 7 ^ 0 by Lemma 14, 
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2. According to the weak law of large numbers, my a and 5^^ 

3. P < c" and P by Lemma 15, 

4. According to Theorem 4, the CLT condition is satisfied. 

Proposition 12 states that 



TYlyB , 


sp 

ruxA ^ 


myA 

qX^ 

mxB 


myB 

P.y.i?) 

is defined 


V 


M (o, Af, p) , 


/ 


p 


1 


myA 

rrtxA 

'm‘^A 


0 


V 


0 


0 


\ 


0 

1 


myB 

ruxB 

mlrg 


Proof [Proof of Proposition 8] The proof follows the same steps as the one of Proposition 

5. ■ 

Proof [Proof of Proposition 9] The proof is another application of Corrolary 13 in Appendix 
A with Xn = /if and Yn = jip for which we check the assumptions: 

1. mxA/myA / 0 by Lemma 14, 

2. By the weak law of large numbers, we have nixA and ruyA. Then 

by Lemma 10, ip niyA and we can apply the continuous mapping theorem 

to get S^^/if ^ mxAfmyA, 

3. According to Lemma 15, we have P /ip = o| = P = o| < c^, 

4. The central limit theorem is stated in Proposition 7. 
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